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Abstract 

Summary: 

Protein Fragment Motif Finder (PFMFind) is a system that enables efficient discovery of relationships between short 
fragments of protein sequences using similarity search. It supports queries based on score matrices and PSSMs obtained 
through an iterative procedure similar to PSI-BLAST. PSSM construction is customisable through plugins written in 
Python. PFMFind consists of a GUI client, an algorithm using an index for fast similarity search and a relational database 
for storing search results and sequence annotations. It is written mostly in Python. All components communicate 
between themselves using TCP/IP sockets and can be located on different physical machines. PFMFind is available for 
UNIX and Windows platforms. 

Availability: 

PFMFind is freely available (und er a GPL licence) for download from the web site of the Centre fo r Biodiscovery, 



Victoria University of Wellington, ittp://www.vuw.ac.nz/biodiscovery/publications/centre/pfmfind.aspx 
Contact: 



astojmir@uottawa.ca 
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Introduction 

The biological functions of proteins are as much a func- 
tion of particular motifs of peptide sequence as they are 
of the overall protein structure. It is of interest to the 
biologist to search for examples of convergent motifs 
as they are likely to indicate a functional role. While 
many approaches exist for finding longer sequence motifs 
(50 amino acids or more), finding relationships between 
short fragments (3-18 amino acids long) of full protein 
sequences also promises great rewards in understanding 
novel aspects of protein structure and function. These re- 
lationships might be evolutionary in origin or might arise 
by convergence, that is, by acquisition of the same biolog- 
ical function in evolutionarily distant species. 

Finding short motifs presents significant challenges be- 
cause many of the apparent relationships between short 
fragments could have arisen by chance and thus have no 
functional significance. Furthermore, most widely avail- 
able tools for sequence database search and motif find- 
ing were designed with longer motifs in mind. For ex- 
ample. Watt and Doyle ( pi] ) recently observed that the 
NCBI BLAST (0) family of programs, the best known set 
of tools for searching biological sequence datasets, is not 
suitable for identifying shorter sequences with particular 
constraints and proposed a pattern search tool to find DNA 
or protein fragments that match a given sequence or a pat- 
tern exactly. This paper outlines the Protein Fragment 
Motif Finder (PMFind), a new tool that uses database 
search to identify the conserved short peptide motifs of 
a query sequence and associates them with the available 
functional annotations. 



Overview 

The PFMFind system consists of three major compo- 
nents: a search engine for fast similarity search of datasets 
of short peptide fragments called FSIndex, a relational 
database, and the PFMFind GUI (graphical user interface) 
client (Figure |l]). PFMFind client takes user input, and 
communicates with FSIndex and the database through its 
components. It passes search parameters in batches to 
FSIndex and receives the results of searches that are then 
stored in the database. It also retrieves the results from 
the database and displays them, together with available 
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Figure 1 ; Structure of PFMFind system. 

annotations, to the user The annotations are stored in a 
separate (BioSQL) schema in the database. 

Most of PFMFind was written in the Python pro- 
gramming language, and uses both the standard Python 
l ibrary and additional m odules such as Biopython 
( http://www.biopython.org ). The components commu- 
nicate using the standard TCP/IP socket interface and 
can therefore be located on different machines. Since 
PFMFind is highly modular, the GUI client can be re- 
placed by a Python script for non-interactive use. 

Similarity Search 

PFMFind supports searches of datasets of short peptide 
fragments of fixed length using an ungapped similarity 
score obtained by summing similarity scores at each po- 
sition of the fragments being compared. The positional 
similarity scores can be defined by standard score matri- 
ces such as PAM ^ or BLOSUM (|), or by PSSMs (po- 
sition specific score matrices) (Q). A dataset consists of 
all fragments of a specified length from a given protein 
sequence dataset (where the fragments may overlap). 

Iterative construction of PSSM, similar to that used by 
PSI-BLAST ([l]), is supported through plugins — Python 
routines that take the results of a previous search and con- 
struct a PSSM. The default plugin uses the weighting pro- 
cedure of Henikoff and Henikoff (S) to assign weights to 
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fragments and Dirichlet mixtures for regularising the 
amino acid frequency counts at each position. Users with 
some knowledge of Python can create their own plugins 
and use them for searches by placing then in the appropri- 
ate directory. 

Search criteria can be specified according to cutoff raw 
similarity scores, distances, p-values and E-values, as 
well as the number of closest datapoints to retrieve. The 
probability model for calculation of p-values assumes that 
the score of each fragment is the sum of independent ran- 
dom variables corresponding to the score at each position 
and the score distribution is calculated using FFT. 

FSIndex 

The heart of PFMFind is FSIndex, an efficient index- 
ing scheme for similarity search of very large datasets 
of short protein fragments of fixed length ^ FSIn- 
dex is based on two principles: reduction of the amino 
acid alphabet to clusters largely based on their biochem- 
ical properties (hydrophobic, polar, charged, aromatic ...) 
and combinatorial generation of neighbours. The design 
of FSIndex means that a typical search involves scanning 
less than 1 % of the fragment dataset, but ensures that no 
neighbours satisfying search criteria are ever missed. 

FSIndex is implemented in the C programming lan- 
guage and embedded into Python, with the whole data 
structure as well as the indexed sequences stored in pri- 
mary memory. For even greater efficiency, computation 
of searches can be distributed among several machines 
using a master/slave model: the master handles p- value 
computations, distributes queries to slaves, each of which 
is indexing a different part of the dataset, and communi- 
cates with the client. 

Database 

The second major component of PFMFind is a relational 
database, used both for storage of search results and the 
sequence annotation. We use PostgreSQL, a freely avail- 
able modern database management system. 

Each user of the system has their own schema for stor- 
ing search results. The database also stores all search pa- 
rameters, including PSSMs and the results of each itera- 



tion, facilitating reversion to a previous iteration without 
repeating the whole procedure. 

The database stores sequence annotations in a stan- 
dard BioSQL schema available to all users. PFMFind 
also contains scripts for loading four types of informa- 
tion beyond the basic sequence information: Uniprot (Q) 
keywords and features, Uniref clusters ^ and InterPro 
(0) domains. When retrieved for display, annotations are 
joined to search results through accession numbers. 



GUI Client 

The final PFMFind component is a GUI client that con- 
nects to both the FSIndex master and the database com- 
ponent. To perform fragment searches, the user specifies 
a query sequence, usually a long sequence that is broken 
into overlapping fragments of fixed length, and chooses 
the fragment lengths, cutoff parameters and the actual 
fragments in the query sequences that will be used for the 
search. 

The GUI client can display search results both as lists 
of hits associated with a particular location in the query 
sequence and as a feature vs location dot plot — each 
location matching a particular feature is marked by a 
coloured dot. Dots are colour coded by the number of 
hits matching the feature to distinguish frequently repre- 
sented features from those that appear only a few times in 
the hit list. All computations of PSSMs are performed by 
the GUI client as well. 



Conclusion 

PFMFind is an efficient, flexible, and extensible frame- 
work for similarity search of datasets of short peptide 
fragments. It supports fast similarity search with selec- 
tivity and sensitivity specified by PSSMs and associates 
search results with biological function by using sequence 
features and annotations. We shall describe our use of 
PFMFind to search for functions associated with short 
fragments that could have arisen by convergence in an- 
other publication. 
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